Executive Summary


Problem: Evaluating NFL kickers accurately presents a significant challange due to the low sample sizes and varying contexts of field goal attempts, and is even more challenging to do midseason. Tradditional metrics often struggle to provide insights into true player skill when sample sizes are limited.

Solution: This is a proposed Bayesian hierarchical model designed to evaluate the performance of NFL kickers. The model accounts for both individual player skill, and the non-linear relationship between kick distance and the probability of success. By estimating unique player effects and then applying these predictions to a standardized set of simulated kicks (25-60 yards), I derive a contextualized and stable evaluation metric.

Key Findings: The Simulated Field Goal Over Expected (FGOE) metric, generated by this model, offers a superior and more reliable assessment of kicker performance compared to traditional observed metrics. This approach successfully mitigates the noise introduced by small sample sizes, enabling better comparisons across players. For example, this analysis demonstrates how the Simulated FGOE accurately positions elite kickers like Justin Tucker at the top, even when their traditional FGOE is skewed by things such as blocked kicks.

Impact: This model provides a way to quantify kicker performance. It not only offers a clear “rating” metric, but also quantifies the uncertainty surrounding each player’s estimated skill through credible intervals. This allows decision-makers to distinguish between genuinely high-performing kickers and those whose performance might be inflated by favorable circumstances or limited data. Players with fewer historical kicks will naturally show larger uncertainty bands, providing a realistic view of the confidence in their evaluation, while consistently strong performers will show tighter, higher-valued estimates. These insights empower scouting teams and analysts to make more informed player personnel decisions.

Contents

The Model

Note

Any code not directly related to the model or derived metrics, such as plotting functions or library imports, have been collapsed to improve readability. You can expand these code blocks by clicking on them.

All necessary library imports and custom utility functions for data preparation, modeling, and plotting are defined below:

Code
import polars as pl
from patsy import dmatrix
import pymc as pm
import numpy as np
import arviz as az
import pandas as pd
from plotnine import *
from pathlib import Path
import matplotlib.pyplot as plt

import polars.selectors as cs

from great_tables import GT, md, html


palette = ["#001e42", "#535c78", "#9fa3b3", "#f1f1f1", "#ffbea4", "#ff895a", "#ff4b00"]


def player_index(df: pl.DataFrame) -> pl.DataFrame:
    """
    Unique player names and id's
    """
    return (
        df[["player_id", "player_name"]]
        .unique()
        .with_columns(player_idx=pl.col("player_id").to_physical().cast(pl.Int64))
    )


def prep_dataset() -> pl.DataFrame:
    """
    Create the base dataframe combining the two data sources

    Returns:
        pl.DataFrame: dataframe of field goal attempts with kicker info
    """
    df_field_goals = pl.read_csv("data/field_goal_attempts.csv")

    df_kickers = pl.read_csv("data/kickers.csv")

    df = (
        df_field_goals.join(df_kickers, on="player_id")
        .rename({"attempt_yards": "distance"})
        .with_columns(
            success=pl.when(pl.col("field_goal_result") == "Made")
            .then(pl.lit(1))
            .otherwise(pl.lit(0)),
            player_id=pl.col("player_id").cast(pl.Utf8).cast(pl.Categorical),
            player_name=pl.col("player_name").cast(pl.Categorical),
            season_type=pl.col("season_type").cast(pl.Categorical),
        )
    ).with_columns(
        distance_std=(pl.col("distance") - pl.col("distance").mean())
        / pl.col("distance").std()
    )

    return df


def make_pymc_data(df)-> dict:
    """
    The pymc model will use 2 arrays passed into it, along with coordinates. This
    function creates the dictionary of data needed to pass into the model. The contexts
    of the spline are also captured to use on new prediction data.

    Returns:
        dict: distance, player info, spline info
    """
    spline_basis = dmatrix(
        "bs(distance_std, df=3, include_intercept = False) - 1",  # removed include_intercept=True) - 1",
        data=df.to_pandas(),
        return_type="dataframe",
    )

    return {
        "distance": df["distance_std"].to_numpy(),
        "spline_array": np.asarray(spline_basis, order="F"),
        "player": df["player_id"].to_physical().cast(pl.Int64).to_numpy(),
        "player_cat": df["player_id"].cat.get_categories(),
        "spline_shape": np.arange(spline_basis.shape[1]),
        "spline_basis": spline_basis,
    }


def get_jt(df) -> int:
    """
    Plotting helper function to highlight Justin Tucker

    Returns:
        int: index of Justin Tucker
    """
    shape = df.shape[0]
    var = df.with_row_index().filter(pl.col("player_name") == "Justin Tucker")["index"][
        0
    ]

    return shape - var


def pointrange_data(
    df: pl.DataFrame | pd.DataFrame,
    field: str = "fgoe",
    group: list = ["player_id", "player_name"],
) -> pl.DataFrame:
    """
    Returns `arviz.hdi()` values from a long dataframe. These values are used to
    produce high density credible intervals for plots.

    Args:
        df: dataframe that contains multiple records per group
        field: (str, optional) field to aggreagate hdci on
        group: (list, optional) fields to group aggregations by

    Returns:
        pl.DataFrame: dataframe with 1 row per group and hdci intervals
    """

    if isinstance(df, pd.DataFrame):
        df = pl.from_pandas(df)

    def hdi_bounds(values, prob=0.9):
        hdi_result = az.hdi(values.to_numpy(), hdi_prob=prob)
        return hdi_result[0], hdi_result[1]

    def hdi_bounds_50(values):
        hdi_result = az.hdi(values.to_numpy(), hdi_prob=0.5)
        return hdi_result[0], hdi_result[1]

    data = (
        df.group_by(group)
        .agg(
            fgoe=pl.col(field).median(),
            hdi_90=pl.col(field).map_elements(
                lambda x: hdi_bounds(x, 0.9), return_dtype=pl.List(pl.Float64)
            ),
            hdi_50=pl.col(field).map_elements(
                lambda x: hdi_bounds_50(x), return_dtype=pl.List(pl.Float64)
            ),
        )
        .with_columns(
            ymin=pl.col("hdi_90").list.get(0),
            ymax=pl.col("hdi_90").list.get(1),
            hdi_lower=pl.col("hdi_50").list.get(0),
            hdi_upper=pl.col("hdi_50").list.get(1),
        )
        .drop(["hdi_90", "hdi_50"])
        .with_columns(
            highlight=pl.when(pl.col("player_name") == "JUSTIN TUCKER")
            .then(pl.lit(palette[-1]))
            .otherwise(pl.lit(palette[1]))
        )
        .sort("fgoe", descending=True)
    )

    return data


def title_name(df: pl.DataFrame | pd.DataFrame):
    """
    Title case names, primarily for plots.
    """
    if isinstance(df, pl.DataFrame):
        return df.with_columns(pl.col("player_name").cast(pl.Utf8).str.to_titlecase())
    elif isinstance(df, pd.DataFrame):
        return df.assign(player_name=lambda x: x.player_name.str.title())
    else:
        raise ValueError


def ggpointinterval(
    df: pl.DataFrame, title: str, axis_label: str, figure_size:tuple=(6, 8), axis_text_size:int=10
) -> ggplot:
    """
    Plot of the distributuion of a metric and multiple credible intervals for
    each player.

    Args:
        df: (pl.DataFrame)
        title: (str) title of the plot
        axis_label: (str) text of the evaluation metric
        figure_size: (tuple, optional) figure size of plot
        axis_text_size: (int, optional) text size of plot

    Returns:
        plotnine.ggplot
    """
    df_plot = pointrange_data(df)
    df_plot = title_name(df_plot)
    return (
        ggplot(
            df_plot,
            aes("reorder(player_name, fgoe)", "fgoe", color="highlight"),
        )
        + geom_pointrange(aes(ymin="ymin", ymax="ymax"))
        + geom_linerange(aes(ymin="hdi_lower", ymax="hdi_upper"), size=1.5)
        + coord_flip()
        + scale_color_identity()
        + labs(
            title=title,
            y=axis_label,
        )
        + theme_minimal()
        + theme(
            legend_position="none",
            axis_title_y=element_blank(),
            figure_size=figure_size,
            dpi=100,
            panel_grid_minor=element_blank(),
            panel_grid_major_y=element_blank(),
            axis_text_y=element_text(size=axis_text_size),
        )
        + geom_hline(yintercept=0, linetype="solid")
        + geom_vline(xintercept=get_jt(df_plot), linetype="dotted", color=palette[-2])
    )

Data

The data contains kicks for multiple seasons, where each kick event is a row. Each kick has a possible of 3 outcomes: Made, Missed, Blocked. The field goals that were Made are counted as a success for modeling purposes.

df = prep_dataset()
df.head()
shape: (5, 14)
season season_type week game_date game_key play_id play_sequence player_id field_goal_result distance player_name birthdate success distance_std
i64 cat i64 str i64 i64 i64 cat str i64 cat str i32 f64
2010 "Pre" 1 "8/8/2010" 55073 433 17 "34623" "Made" 20 "DAVID BUEHLER" "2/5/1987" 1 -1.723446
2010 "Pre" 1 "8/8/2010" 55073 2661 104 "34623" "Missed" 49 "DAVID BUEHLER" "2/5/1987" 0 1.079384
2010 "Pre" 1 "8/8/2010" 55073 2772 109 "34623" "Made" 23 "DAVID BUEHLER" "2/5/1987" 1 -1.433498
2010 "Pre" 1 "8/8/2010" 55073 1604 64 "34623" "Made" 34 "DAVID BUEHLER" "2/5/1987" 1 -0.370356
2010 "Pre" 2 "8/12/2010" 55076 3086 123 "34623" "Made" 28 "DAVID BUEHLER" "2/5/1987" 1 -0.950251

Success Rate

The plot below shows the success rate by distance.

Code
df_plot = df.group_by("distance").agg(pl.col("success").sum() / pl.count("success"))
(
    ggplot(df_plot, aes("distance", "success"))
    + geom_smooth(color = palette[0], se = False) # 'lowess CI isn't avialble yet'
    + geom_point(alpha = 0.2, color=palette[0])
    + labs(
        title="Success Rate by Distance - Observed Data",
        x="Attempt Distance (yards)",
        y="Success Rate",
    )
    + theme_minimal()
)

Features

As expected, an increase in distance causes the success rate to drop. After roughly 45 yards, success rates decline at a steeper rate. Using splines in the model will allow us to fit this non-linear curve and estimate the relationship more realistically than a quadratic model.

Intrinsically captured in the data is the fact that players have varying skill levels, something traditional analytics cannot fully account for. The sparse data and hierarchical structure of the underlying process make this well suited for hierarchical modeling methods.

I will begin by building a simple but effective model, incorporating both kick distance (via splines) and individual player effects as primary features.

Model

Model Definition

With the individual observed kicks, the scope of this analysis will be to estimate an individual player’s probability of success. The following model was used:

\[ \begin{aligned} y_i &\sim \operatorname{Bernoulli}(p_i) \\ \operatorname{logit}(p_i) &= \alpha + \theta_{j[i]} + \sum_{k=1}^K B_k(x_i) \, \beta_k \\ \alpha &\sim \operatorname{Normal}(1, 1) \\ % Global fixed intercept \theta_j &\sim \operatorname{Normal}(0, \sigma) \\ \sigma &\sim \operatorname{HalfNormal}(0.5) \\ \beta_k &\sim \operatorname{Student-t}(3, 0, 1) \quad \text{for } k = 1, \dots, K \\ \end{aligned} \]

Overview: - Bernoulli likelihood for the binary outcome - Partially pooled intercepts by player - The distribution of player effects share the same sigma parameter - Spline with basis functions applied to distance

This model generates player-specific intercepts that adapt based on individual performance and data availability. Players with more observations will have their estimates pulled toward their own performance, while those with limited data will remain closer to the population prior. The shared variance across all players creates partial pooling, where information is borrowed between players to improve estimates for those with limited data.

The model outlined above is used, but implemented using the non-centered parameterization technique. This approach is mathematically equivalent to the centered formulation, but it allows for sampling geometries that are less correlated. This typically improves convergence and sampling efficiency in hierarchical models.

The key difference lies in how the player’s offset from the mean, \(\tilde{\theta}_j\), and the group-level standard deviation, \(\sigma\), are sampled independently. This decouples their dependency, resulting in a posterior geometry that is easier for the MCMC algorithms to explore.

\[ \begin{aligned} y_i &\sim \operatorname{Bernoulli}(p_i) \\ \operatorname{logit}(p_i) &= \alpha + \theta_{j[i]} + \sum_{k=1}^K B_k(x_i) \, \beta_k \\ \alpha &\sim \operatorname{Normal}(1, 1) \\ % Global fixed intercept \theta_j &= \tilde{\theta}_j \cdot \sigma \\ \tilde{\theta}_j &\sim \operatorname{Normal}(0, 1) \\ \sigma &\sim \operatorname{HalfNormal}(0.5) \\ \beta_k &\sim \operatorname{Student\text{-}t}(3, 0, 1) \quad \text{for } k = 1, \dots, K \\ \end{aligned} \]

Prior Justification

One of the strengths of Bayesian modeling is the ability to incorporate domain knowledge through prior distributions. For instance, we expect field goals from typical distances — such as 30 yards — to be successful more often than not. To reflect this, I placed a prior on the intercept of \(Normal(1, 1)\), which corresponds to a probability greater than 50% (with the use of the logit link function).

Player effects are modeled with a normal distribution, a common choice in hierarchical models - particularly with non-centered parameterization. This prior is skeptical of extreme outliers and assumes the average player is roughly replacement level.

For the spline terms, I used a Student-t prior. This allows for heavier tails, enabling the model to capture more extreme (but plausible) effects of distance on field goal success. A minimal number of knots was used in the spline to provide flexibility while reducing the risk of overfitting.

Model Configuration

data = make_pymc_data(df)

COORDS = {
    "splines": data["spline_shape"],
    "player_id": data["player_cat"],
}

with pm.Model(coords=COORDS) as model:
    # -------------- DATA -------------------------------
    spline_data = pm.Data("spline_data", data["spline_array"])
    player_idx = pm.Data("player_idx", data["player"])

    # ------------- PARAMETERS ---------------------------
    # globa intercept
    alpha = pm.Normal("alpha", mu=1, sigma=1)

    # splines
    w = pm.StudentT("w", nu=3, mu=0, sigma=1, dims="splines")
    spline_contrib = pm.math.dot(spline_data, w.T)

    # player effects
    player_offset = pm.Normal("player_offset", mu=0, sigma=1, dims="player_id")
    sigma = pm.HalfNormal("sigma", sigma=0.5)
    player_effect = pm.Deterministic("1|player_id", player_offset * sigma)

    # ------------- MODEL --------------------------------
    mu = alpha + player_effect[player_idx] + spline_contrib
    p = pm.Deterministic("p", pm.math.invlogit(mu))

    # likelihood
    pm.Bernoulli("y_rep", p=p, observed=df["success"].to_numpy())

    # -------------- GENERATED QUANTITIES ----------------
    mu_baseline = alpha + spline_contrib
    p_baseline = pm.Deterministic("p_baseline", pm.math.invlogit(mu_baseline))

    # -------------- PRIOR PREDICTIVE --------------------
    idata = pm.sample_prior_predictive()

pm.model_to_graphviz(model)
Sampling: [alpha, player_offset, sigma, w, y_rep]

The model graph above shows how each component contributes to the model. The spline_data and player_idx filled blocks represent data elements. The unfilled ovals, which are parameters, feed into various deterministic nodes, such as 1|player_id, p_baseline or p. Because this is a generative model, it can simulate different quantities given different input data.

The parameter p represents the probability of a successful field goal and is used in downstream metrics. Once the model is fit and the parameters are estimated, the spline_data, spline_weights, and global intercept alpha combine to produce a probability baseline, p_baseline, which will later be used to assess player effects.

Prior Predictive Checks

After specifying the model structure and priors, a diagnostic tool in prior predictive checks help assess whether the model generates realistic values. In this case, since field goals are made more often than missed, the prior predictive distribution should reflect more success.

Code
df_prior = pl.from_pandas(
    idata.prior_predictive["y_rep"].to_dataframe().reset_index()
).rename({"y_rep": "success"})

(
    ggplot(aes("success", "len"))
    + geom_col(
        aes(group="draw", color="t"),
        position="identity",
        fill=None,
        alpha=0.01,
        data=(
            df_prior.group_by(["draw", "success"])
            .len()
            .with_columns(t=pl.lit("prior predictive"))
        ),
    )
    + geom_col(
        aes(color="t"),
        fill=None,
        size=1.5,
        data=(
            df_prior.group_by(["draw", "success"])
            .len()
            .group_by("success")
            .agg(pl.col("len").mean())
            .with_columns(t=pl.lit("prior predictive mean"))
        ),
    )
    + theme_minimal()
    + labs(title="Prior Predictive Checks", x="Success", y="Count", color="")
    + scale_x_continuous(breaks=[0, 1])
    + theme(
        panel_grid_minor=element_blank(),
        panel_grid_major_x=element_blank(),
        legend_position="top",
        legend_background=element_blank(),
    )
    + scale_color_manual(values=(palette[2], palette[-1]))
)

The results from the prior predictive check suggest that the priors lead to outcomes consistent with the expectation, indicating that the model has a plausible data-generating process.

Model Fit

The model is fit by sampling from the posterior distribution of the parameters. The plot below visualizes these posterior distributions along with rank bars for each chain. A roughly uniform rank distribution indicates good mixing and suggests that the chains have converged.

if Path("models/idata.nc").exists():
    idata = az.from_netcdf("models/idata.nc")

else:
    with model:
        idata.extend(pm.sample(cores=1, chains=4, random_seed=527))
        az.to_netcdf(idata, "models/idata.nc")

params = ["alpha", "w", "1|player_id", "sigma"]

_ = az.plot_trace(idata, var_names=params, kind="rank_bars")
plt.show()

Diagnostics

The Markov Chain Monte Carlo (MCMC) algorithm used to fit the model comes with its own set of diagnostic criteria. While these diagnostics don’t address causal assumptions or domain-specific model validity, they ensure that the MCMC algorithm is functioning properly.

az.summary(idata, var_names=params)
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha 4.624 0.298 4.057 5.171 0.007 0.005 1881.0 2145.0 1.0
w[0] -3.745 0.775 -5.234 -2.342 0.018 0.013 1775.0 1834.0 1.0
w[1] -2.794 0.464 -3.646 -1.936 0.009 0.007 2522.0 2439.0 1.0
w[2] -7.392 0.820 -8.863 -5.784 0.020 0.014 1678.0 1894.0 1.0
1|player_id[0] -0.091 0.213 -0.527 0.288 0.003 0.004 6586.0 3040.0 1.0
... ... ... ... ... ... ... ... ... ...
1|player_id[133] -0.157 0.249 -0.630 0.289 0.004 0.005 4407.0 2441.0 1.0
1|player_id[134] -0.059 0.236 -0.526 0.368 0.003 0.004 4789.0 2922.0 1.0
1|player_id[135] -0.044 0.261 -0.572 0.411 0.003 0.005 6609.0 3017.0 1.0
1|player_id[136] -0.001 0.256 -0.477 0.481 0.003 0.005 5534.0 2703.0 1.0
sigma 0.249 0.051 0.153 0.345 0.001 0.001 1187.0 2052.0 1.0

142 rows × 9 columns

The following table summarizes these diagnostics across the model parameters. The distribution of bulk and tail ESS values, along with \(\hat{R}\), to ensure all parameters were sampled adequately and that the posterior inferences are trustworthy.

  • Bulk ESS (Effective Sample Size) estimates how many independent samples are effectively drawn from the bulk of the posterior distribution. Higher values indicate more reliable estimates.
  • Tail ESS focuses on the tails of the distribution, which are important for accurately characterizing uncertainty.
  • \(\hat{R}\) (R-hat) is a convergence diagnostic comparing variances in chains. Values close to 1 (typically below 1.01) indicate that the chains have mixed well and converged.
az.summary(idata, var_names=params)[["ess_bulk", "ess_tail", "r_hat"]].describe(
    percentiles=[0.05, 0.5, 0.95]
).round()
ess_bulk ess_tail r_hat
count 142.0 142.0 142.0
mean 5331.0 2836.0 1.0
std 1043.0 267.0 0.0
min 1187.0 1834.0 1.0
5% 3056.0 2315.0 1.0
50% 5540.0 2864.0 1.0
95% 6564.0 3161.0 1.0
max 7954.0 3303.0 1.0

In this case, all chains converged without divergent transitions, and the effective sample sizes are sufficient.

Assesing Fit

The plot below shows that the model produces predictions consistent with the observed data, suggesting that the model’s assumptions and structure are appropriate for summarizing performance.

with model:
    df_ppd = pm.sample_posterior_predictive(idata, var_names=['y_rep'], random_seed=123)
Sampling: [y_rep]

Code
df_ppd = pl.from_pandas(
    df_ppd.posterior_predictive.sel(draw=slice(0, 100))
    .stack(sample=["chain", "draw"])
    .to_dataframe()["y_rep"]
    .reset_index()
).rename({"y_rep": "success"})

(
    ggplot(aes("success", "len"))
    + geom_col(
        aes(group="draw", color="t"),
        position="identity",
        fill=None,
        alpha=0.01,
        data=(
            df_ppd.group_by(["chain", "draw", "success"])
            .len()
            .with_columns(t=pl.lit("Posterior predictive"))
        ),
    )
    + geom_col(
        aes(color="t"),
        fill=None,
        size=1.5,
        data=(df.group_by(["success"]).len().with_columns(t=pl.lit("Observed"))),
        alpha=0.02,
    )
    + theme_minimal()
    + labs(title="Posterior Predictive Checks", x="Success", y="Count", color="")
    + scale_x_continuous(breaks=[0, 1])
    + theme(
        panel_grid_minor=element_blank(),
        panel_grid_major_x=element_blank(),
        legend_position='top',
        legend_background=element_blank(),
    )
    + scale_color_manual(values=(palette[2], palette[-1]))
)

Posterior Distributions

Player Effects

The table below summarizes the posterior distributions for each player’s individual effect on the baseline probability of success, after controlling for kick distance.

Code
player_index(df).join(
    pl.from_pandas(az.summary(idata, var_names="1|player_id").reset_index()).with_columns(
        player_idx=pl.col("index").str.extract(r"player_id\[(\d+)\]").cast(pl.Int64)
    ),
    on="player_idx",
).select(pl.exclude(["player_idx", "index"])).with_columns(pl.col('player_name').cast(pl.Utf8).str.to_titlecase())
shape: (137, 11)
player_id player_name mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
cat str f64 f64 f64 f64 f64 f64 f64 f64 f64
"34623" "David Buehler" -0.091 0.213 -0.527 0.288 0.003 0.004 6586.0 3040.0 1.0
"10114" "John Kasay" -0.056 0.197 -0.415 0.317 0.002 0.003 6785.0 3133.0 1.0
"33337" "Garrett Hartley" -0.11 0.181 -0.438 0.242 0.002 0.003 5337.0 2634.0 1.0
"26127" "Shayne Graham" 0.031 0.181 -0.288 0.381 0.002 0.003 6444.0 2864.0 1.0
"30932" "Stephen Gostkowski" 0.263 0.147 -0.018 0.533 0.002 0.003 4254.0 3027.0 1.0
"46832" "Ryan Santoso" 0.018 0.249 -0.424 0.519 0.004 0.004 4875.0 2856.0 1.0
"46236" "Daniel Carlson" -0.157 0.249 -0.63 0.289 0.004 0.005 4407.0 2441.0 1.0
"46663" "Matthew Mccrane" -0.059 0.236 -0.526 0.368 0.003 0.004 4789.0 2922.0 1.0
"46691" "Tyler Davis" -0.044 0.261 -0.572 0.411 0.003 0.005 6609.0 3017.0 1.0
"38815" "Johnny Hekker" -0.001 0.256 -0.477 0.481 0.003 0.005 5534.0 2703.0 1.0

While these individual player effects help understand the model’s insights into each kicker’s baseline skill, their direct interpretation as raw coefficients can be challenging to communicate to non-technical stakeholders a standalone evaluation metric. This is why I will move towards the Simulated Field Goal Over Expected (FGOE), which translates these effects into a more relatable and actionable rating of expected performance on a standardized set of kicks.

Spline

The table below shows the summary of spline coefficient distributions.

az.summary(idata, var_names=["alpha", 'w'])
mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
alpha 4.624 0.298 4.057 5.171 0.007 0.005 1881.0 2145.0 1.0
w[0] -3.745 0.775 -5.234 -2.342 0.018 0.013 1775.0 1834.0 1.0
w[1] -2.794 0.464 -3.646 -1.936 0.009 0.007 2522.0 2439.0 1.0
w[2] -7.392 0.820 -8.863 -5.784 0.020 0.014 1678.0 1894.0 1.0

Unlike player effects, they are not as easily interpretable in a table. The plots below are to help build an understanding of how they come together.

Basis functions

There are 3 basis functions used in the model. Below is a plot of each basis function and it’s new shape after learning parameter weights.

Code
def create_spline_table() -> pl.DataFrame:
    """
    Takes the parameter means from the posterior and applies them to
    the spline basis dataframe. This allows for plotting the estimated
    spline functions.

    Returns:
        pl.DataFrame: long spline bf dataframe
    """
    w = idata.posterior["w"].mean(["chain", "draw"]).values

    return (
        pl.DataFrame(data["spline_basis"].values * w.T)
        .with_columns(distance=df["distance"].to_numpy())
        .unpivot(cs.starts_with("col"), index="distance")
        .with_columns(pl.col("variable").str.extract(r"(\d+)"))
    )


def create_spline_table_with_hdi(hdi_prob:float=0.95) -> pl.DataFrame:
    """
    Creates a dataframe with the high density intervals for the aggregated
    spline function.

    Returns:
        pl.DataFrame
    """

    # Get HDI bounds
    hdi_bounds = az.hdi(idata.posterior["w"], hdi_prob=hdi_prob)
    w_lower = hdi_bounds["w"].sel(hdi="lower").values
    w_upper = hdi_bounds["w"].sel(hdi="higher").values

    # Create dataframes for lower and upper bounds
    df_lower = (
        pl.DataFrame(data["spline_basis"].values * w_lower.T)
        .with_columns(distance=df["distance"].to_numpy())
        .unpivot(cs.starts_with("col"), index="distance")
        .with_columns(pl.col("variable").str.extract(r"(\d+)"))
        .group_by("distance")
        .agg(pl.col("value").mean().alias("lower"))
    )

    df_upper = (
        pl.DataFrame(data["spline_basis"].values * w_upper.T)
        .with_columns(distance=df["distance"].to_numpy())
        .unpivot(cs.starts_with("col"), index="distance")
        .with_columns(pl.col("variable").str.extract(r"(\d+)"))
        .group_by("distance")
        .agg(pl.col("value").mean().alias("upper"))
    )

    return df_lower.join(df_upper, on="distance")


df_spline = create_spline_table()
(
    ggplot(df_spline, aes("distance", "value", color="factor(variable)"))
    + geom_line(size=1.2)
    + labs(title="Basis Function Shape", color="bf")
    + theme_minimal()
    + theme(legend_position="top")
    + labs(x="Distance", y="Value")
    # + scale_color_manual(values=[palette[0], palette[1], palette[-3], palette[-2]])
)

Aggregated spline

Finally, the individual spline components can be aggregated to estimate the overall effect of distance on field goal success.

Code
df_spline_agg = (
    df_spline.group_by("distance")
    .agg(pl.col("value").mean())
    .join(create_spline_table_with_hdi(), on="distance")
)

(
    ggplot(
        df_spline_agg,
        aes("distance", "value"),
    )
    + geom_line(aes("distance", "value"), size=1.2, color=palette[0])
    + geom_ribbon(aes(ymin="lower", ymax="upper"), alpha = 0.2)
    + labs(title="General Spline Shape")
    + theme_minimal()
    + labs(x="Distance", y="Value")
)

Bayesian modeling allows us to quantify the uncertainty in the relationship between distance and the probability of a successful field goal. At shorter distances, the effect is more stable and tightly estimated. However, at longer distances there is noticeably more uncertainty in the predicted success rate.

Field Goals Over Expected

Traditional FGOE

What I outlined in my application was establishing at minimum a Field Goal Over Expected model. This contextualizes each kick and helps identify players over or underperforming relative to expectation. This metric is used across multiple sports because it’s intuitive and easily digestible, especially for audiences with less analytical training. It does a better job than raw completion rate, but has room for improvement.

2018 results

To calculate a traditional FGOE, the average expected probabilities accounting for distance and removing the player grouping effects can be used. FGOE is simply the observed event minus those expected probabilites. This can be done with any point estimate, but with a Bayesian methods we again have a distribution of values to use.

Utility Functions:
Code
def add_avg_draws(
    df: pl.DataFrame, data: dict = data, model: pm.Model = model
) -> pl.DataFrame:
    """
    Get the predicted probabilities on distance without player effects. Similar to
    `bambi.Model.predict(..., infer_group_specific=False)` or
    `tidybayes::add_epred_draws(..., re_formula = NA`)`.

    These are used to calculate the FGOE metrics.

    """
    design_info = data["spline_basis"].design_info
    spline_array = data["spline_basis"].values

    _spline_array = dmatrix(design_info, df.to_pandas(), return_type="dataframe").values

    with model:
        pm.set_data(
            {
                "player_idx": np.zeros_like(
                    df["player_id"].to_numpy(), dtype=np.int64
                ),  # or exclude entirely
                "spline_data": _spline_array,
            }
        )
        pred_baseline = pm.sample_posterior_predictive(idata, var_names=["p_baseline"])

    df_pred_baseline = pl.from_pandas(
        pred_baseline.posterior_predictive["p_baseline"].to_dataframe().reset_index()
    ).rename({"p_baseline_dim_0": "__obs__"})

    return df_pred_baseline


def add_epred_draws(
    df: pl.DataFrame, var_name: str, data: dict = data, model: pm.Model = model
) -> pl.DataFrame:
    """
    Get the predicted probabilities on distance with player effects. Similar to
    `bambi.Model.predict(..., infer_group_specific=True)` or
    `tidybayes::add_epred_draws(..., re_formula = NA`)`.

    These are used to calculate the FGOE metrics.

    Returns:
        pl.DataFrame
    """
    design_info = data["spline_basis"].design_info
    spline_array = data["spline_basis"].values

    _spline_array = dmatrix(design_info, df.to_pandas(), return_type="dataframe").values

    with model:
        pm.set_data(
            {
                "player_idx": df["player_id"].to_physical().cast(pl.Int64).to_numpy(),
                "spline_data": _spline_array,
            }
        )
        df_y = pm.sample_posterior_predictive(
            idata, var_names=[var_name], extend_inferencedata=False
        )

    _df = pl.from_pandas(
        df_y.posterior_predictive[var_name].to_dataframe().reset_index()
    )

    _df.columns = ["chain", "draw", "__obs__", var_name]

    return _df

def create_fgoe_table(df: pl.DataFrame) -> pl.DataFrame:
    """
    Take the player, distance dataframe and calculate FGOE by
    subtracting the probability without player effect from the probability
    with player effects. This gives the foundation of ways to evaluate kickers.

    Returns:
        pl.DataFrame
    """
    _df_epred_baseline = add_avg_draws(df)
    _df_epred = add_epred_draws(df, "p")

    cols = [
        x
        for x in ["player_name", "player_id", "distance", "success"]
        if x in df.columns
    ]
    df_fgoe = (
        df.select(cols)
        .with_row_index("__obs__")
        .join(_df_epred_baseline, on="__obs__")
        .join(_df_epred, on=["chain", "draw", "__obs__"])
        .with_columns(fgoe=pl.col("p") - pl.col("p_baseline"))
    )

    return df_fgoe
2018 observed data

For each kick in 2018, we can aggregate each players traditional FGOE to evaluate their performance during the season.

df_2018 = df.filter(pl.col("season") == 2018)
df_fgoe = create_fgoe_table(df_2018)

df_tfgoe = (
    df_fgoe.with_columns(fgoe=pl.col("success") - pl.col("p_baseline"))
    .group_by(["chain", "draw", "player_name", "player_id"])
    .agg(pl.col("fgoe").sum())
)

ggpointinterval(
    df_tfgoe,
    title="Traditional FGOE - 2018",
    axis_label="Traditional FGOE",
)


The problem with this metric? In 2018 Justin Tucker is represented as a replacement-level kicker - or worse. Yet historically, Tucker had been one of the most reliable kickers in the league going into the 2018 season. Pitching this metric and leaderboard during the season would have been difficult to justify, and rightfully so.

2016-2017 results

Looking at the previous 2 seasons, Tucker’s Traditional FGOE metric stood out amoung the rest.

Code
df_before = df.filter(pl.col("season").is_in([2016, 2018]))

df_tfgoe_before = (
    create_fgoe_table(df_before)
    .with_columns(fgoe=pl.col("success") - pl.col("p_baseline"))
    .group_by(["chain", "draw", "player_name", "player_id"])
    .agg(pl.col("fgoe").sum())
)

ggpointinterval(
    df_tfgoe_before,
    title="Traditional FGOE by Player - 2016-2017",
    axis_label="Traditional FGOE",
    axis_text_size=8,
)


It isn’t impossible, but it should be highly unlikely that he dropped from an outlier lead, to one of the worst in 6 weeks.

The problem with using this metric alone is that limited sample sizes can introduce a significant amount noise. Upon examining the dataset, there were multiple blocked field goals against Tucker. While outside the scope of this analysis, it’s uncertain how much credit should be assigned to the kicker versus the blocking on the play for blocked kicks.

2018 results - without blocks

When blocked kicks are removed, the model results align a little better with expectations.

Code
df_2018 = df.filter(pl.col("season") == 2018).filter(
    pl.col("field_goal_result") != "Blocked"
)

df_tfgoe_wout_blocks = (
    create_fgoe_table(df_2018)
    .with_columns(fgoe=pl.col("success") - pl.col("p_baseline"))
    .group_by(["chain", "draw", "player_name", "player_id"])
    .agg(pl.col("fgoe").sum())
)

ggpointinterval(
    df_tfgoe_wout_blocks,
    title="Traditional FGOE - 2018 | Blocks Removed",
    axis_label="Traditional FGOE",
)


However, this approach is arbitrary as it requires manual intervention that doesn’t provide a way to account for such events within the model itself, highlighting the need for a better solution.

Another issue to consider are the fact that established veterans like Justin Tucker and Mason Crosby show some of the widest posterior distributions. Players with minimal experience like Tyler Davis and Ross Martin (each with 1 FG attempt) have tight estimates. This is counterintuitive from a Bayesian perspective. If a player has limited data, we should have less confidence in estimating their true ability. But with more data, we expect the model to move away from the prior and narrow in on a more certain estimate.


Simulated FGOE

The model allows us to extract each player’s parameter distributions as well as distance effects, and apply it to simulated data. This enables what-if scenarios and projections beyond only the observed data. We can establish expectations for what each kicker could make in this hypothetical world where they each attempt the same set of kicks.

Standardized kick predictions

I chose a linear interval of kicks from 25-60 yards. This decision is arbitrary, but prioritizes simplicity and represents the first step toward gaining organizational buy-in for the leaderboard.

df_grid = (
    df.filter(pl.col("season") == 2018)
    .select(["player_id", "player_name"])
    .unique()
    .join(
        pl.DataFrame({"distance": [x for x in range(25, 61)]}).with_columns(
            distance_std=(pl.col("distance") - pl.col("distance").mean())
            / pl.col("distance").std()
        ),
        how="cross",
    )
)

df_grid.drop(['distance_std'])
shape: (1_836, 3)
player_id player_name distance
cat cat i64
"38691" "RANDY BULLOCK" 25
"38691" "RANDY BULLOCK" 26
"38691" "RANDY BULLOCK" 27
"38691" "RANDY BULLOCK" 28
"38691" "RANDY BULLOCK" 29
"35102" "GRAHAM GANO" 56
"35102" "GRAHAM GANO" 57
"35102" "GRAHAM GANO" 58
"35102" "GRAHAM GANO" 59
"35102" "GRAHAM GANO" 60

With the simulation of kicks, we can now look at Simulated FGOE. This operates at the scale of posterior sample draws, with each player having 4,000 simulations across a distribution of kicks. This method provides the ability to quantify uncertainty within the estimates and examine the range of potential outcomes, while addressing the sample size problem of a traditional FGOE.

Plotting Simulated FGOE

The plot below shows the posterior distributions of simulated FGOE. Compared to the traditional version, this metric is more stable with less noise, and it follows intuitive principles that make interpretation easier. Many players cluster around zero, roughly representing replacement level, while deviations in either direction are easy to interpret.

df_fgoe = create_fgoe_table(df_grid)

df_sfgoe = df_fgoe.group_by(["chain", "draw", "player_name", "player_id"]).agg(
    pl.col("fgoe").sum()
)

ggpointinterval(
    df_sfgoe,
    title="Simulated FGOE - 2018",
    axis_label="Simulated FGOE",
)


For example, Justin Tucker’s median rating falls between 1 and 3 field goals over expected, with a central estimate of approximately 2. Even his low-end outcomes are better than most players median performance. That’s far more intuitive and communicable than saying the player’s log-odds intercept increases by 0.2 to 0.9 after adjusting for distance.

Comparing output

To compare traditional and simulated FGOE, we examine the posterior distributions for each method. Simulated FGOE offers a more consistent and interpretable view of player performance. The distributions include credible intervals that reflect the uncertainty in player evaluation, ultimately making the metric more robust and reliable for assessing performance.

Note

The models aren’t necesarily 1:1 comparisons of the same distribution of kicks, because with the traditional FGOE we are limited to the kicks that the kicker actually attempted. Simulated FGOE is applied counterfactually to the same set of kicks for each kicker. It is more important to look directionally at each model. Easily shown is Justin Tucker is firmly at the top of the list, without having to arbitrarily adjust for blocked kicks.

Code
df_comparison = (
    df_sfgoe.join(
        df_tfgoe, on=["chain", "draw", "player_name", "player_id"], suffix="_t"
    )
    .unpivot(["fgoe", "fgoe_t"], index=["chain", "draw", "player_name", "player_id"])
    .with_columns(
        model = pl.when(pl.col("variable") == "fgoe")
        .then(pl.lit("Simulated"))
        .otherwise(pl.lit("Traditional"))
    )
    .rename({'value':'fgoe'})
)

df_comparison_r = pointrange_data(
    df_comparison, group=["player_id", "player_name", "model"]
).join(
    df_sfgoe.group_by(["player_name", "player_id"]).agg(
        plot_sort=pl.col("fgoe").mean(), simulations=pl.len()
    ),
    on=["player_id", "player_name"],
)
df_comparison_r = title_name(df_comparison_r)

(
    ggplot(
        df_comparison_r, aes("reorder(player_name, plot_sort)", "fgoe", color="model")
    )
    + geom_linerange(
        aes(ymin="hdi_lower", ymax="hdi_upper"), size=1.2, position=position_dodge(0.5)
    )
    + geom_pointrange(
        aes(ymin="ymin", ymax="ymax"),
        shape="o",
        fill="white",
        position=position_dodge(0.75),
    )
    + coord_flip()
    + labs(
        y="Simulated FGOE",
    )
    + theme_minimal()
    + theme(
        legend_position="top",
        axis_title_y=element_blank(),
        figure_size=(6, 8),
        dpi=100,
        panel_grid_minor=element_blank(),
        panel_grid_major_y=element_blank(),
        axis_text_y=element_text(size=8),
    )
    + geom_hline(yintercept=0, linetype="dashed")
    + labs(
        title="Simulated vs Traditional FGOE",
        y="Simulated FGOE",
        color="Model:",
    )
    + scale_color_manual(("#001e42", "#ff4b00"))
)

Which player would we prefer?

With the model parameters and simulation, even if a kicker hasn’t attempted a 60 yard kick - we can provide an estimate for what that might look like. This enables probability curves by distance which can help visualize comparisons between players.

Code
plot_df = df_fgoe.filter(pl.col('player_name').is_in(['JUSTIN TUCKER', 'ROBERTO AGUAYO'])).with_columns(pl.col('player_name').cast(pl.Utf8))

(
    ggplot(aes("distance", "p", color="player_name"))
    + geom_ribbon(
        aes(y="fgoe", ymin="ymin", ymax="ymax", fill="player_name"),
        data=pointrange_data(plot_df, field="p", group=["player_name", "distance"]),
        alpha=0.2,
        outline_type=element_blank(),
        show_legend={"color": False},
    )
    + geom_ribbon(
        aes(y="fgoe", ymin="hdi_lower", ymax="hdi_upper", fill="player_name"),
        # fill="#00000053",
        data=pointrange_data(plot_df, field="p", group=["player_name", "distance"]),
        alpha=0.4,
        outline_type=element_blank(),
        show_legend={"color": False},
    )
    + geom_line(
        data=plot_df.group_by(["player_name", "distance"]).agg(pl.col("p").mean()),
        size=1.5,
        show_legend=False
    )
    + theme_minimal()
    + theme(legend_position="top", panel_grid_minor=element_blank())
    + labs(
        title="Modeled Probability by Distance",
        y="P(Make)",
        x="Distance",
        fill="Player:",
    )
    + scale_x_continuous(limits=(25, 60))
    + scale_color_manual(values=(palette[-1], palette[0]))
    + scale_fill_manual(values=(palette[-1], palette[0]))
)

Leaderboard

The table below represents the generated leaderboard, showcasing the Player Id, Player Name, Rating (Simulated FGOE), and Rank for the top 10 and bottom 10 kickers evaluated in the 2018 season. The Rank column is in ascending order, where a lower rank indicates a better performance. The Top 5 and Top 10 columns indicate the percentage of simulations in which a player ranked among the top 5 and top 10 kickers, offering a probabilistic measure of their high-end outcomes.

Code
def create_leaderboard(df: pl.DataFrame) -> pl.DataFrame:
    """
    Create the leaderboard of FGOE values

    Args:
        df: (pl.DataFrame) dataframe from `create_fgoe_draws()`

    Returns:
        pl.DataFrame: Polars dataframe of leaderboard

    """
    idata_post = df.to_pandas().set_index(["chain", "draw", "player_name"]).to_xarray()

    # perc player was in top 5 of a draw
    df_top_5 = (
        df.with_columns(
            player_name=pl.col("player_name").cast(pl.Utf8),
            rank=pl.col("fgoe").rank("dense", descending=True).over(["draw", "chain"]),
        )
        .group_by(["player_name"])
        .agg(top_5=(pl.col("rank") <= 5).mean())
        .sort("top_5", descending=True)
    )

    df_top_10 = (
        df.with_columns(
            player_name=pl.col("player_name").cast(pl.Utf8),
            rank=pl.col("fgoe").rank("dense", descending=True).over(["draw", "chain"]),
        )
        .group_by(["player_name"])
        .agg(top_10=(pl.col("rank") <= 10).mean())
        .sort("top_10", descending=True)
    )

    # summary table of the metric, with rank added and joined with top 5
    leaderboard = (
        (
            pl.from_pandas(
                az.summary(idata_post, var_names=["fgoe"], hdi_prob=0.9)
                .reset_index()
                .assign(player_name=lambda x: x["index"].str.extract(r"fgoe\[(.+)\]"))
            )
            .join(df_top_5, on="player_name")
            .join(df_top_10, on="player_name")
            .sort(["mean"], descending=True)
            .with_row_index("rank", offset=1)
            .rename({"mean": "rating"})
            .join(
                df[["player_name", "player_id"]]
                .unique()
                .with_columns(pl.col("player_name").cast(pl.Utf8)),
                on="player_name",
            )
        )
        .select(
            [
                "player_id",
                "player_name",
                "rank",
                "rating",
                "sd",
                "hdi_5%",
                "hdi_95%",
                "top_5",
                "top_10",
            ]
        )
        .sort(["rank"])
    )

    return leaderboard


leaderboard = create_leaderboard(df_sfgoe)
leaderboard.write_csv("leaderboard.csv")

(
    GT(
        pl.concat([leaderboard.head(13), leaderboard.tail(13)]).rename(
            {"hdi_5%": "lower", "hdi_95%": "upper"}
        )
    )
    .tab_header(
        title="2018 NFL Kicker Leaderboard", subtitle="Top and bottom 10 players shown"
    )
    .tab_spanner(label="Simulated FGOE", columns=["rating", "sd", "lower", "upper"])
    # .tab_stubhead(label="rank")
    .fmt_number(columns=["rating", "sd", "lower", "upper"], decimals=2)
    .fmt_percent(["top_5", "top_10"], decimals=1)
    .data_color(
        columns=["rating"],
        domain=[leaderboard["rating"].min(), leaderboard["rating"].max()],
        palette=["#535c78", "white", "#ff4b00"],
        na_color="white",
    )
    .data_color(
        columns=["lower"],
        domain=[leaderboard["hdi_5%"].min(), leaderboard["hdi_5%"].max()],
        palette=["#535c78", "white", "#ff4b00"],
        na_color="white",
    )
    .data_color(
        columns=["upper"],
        domain=[leaderboard["hdi_95%"].min(), leaderboard["hdi_95%"].max()],
        palette=["#535c78", "white", "#ff4b00"],
        na_color="white",
    )
    .data_color(
        columns=["top_5", "top_10"],
        domain=[0, 1],
        palette=["white", "#ff4b00"],
        na_color="white",
    )
    .cols_label(
        player_id=html("Player Id"),
        player_name=html("Player Name"),
        rank="Rank",
        rating="Rating",
        sd="Variance",
        lower="Low-end",
        upper="High-end",
        top_5="Top 5",
        top_10 = "Top 10"
    )
)
2018 NFL Kicker Leaderboard
Top and bottom 10 players shown
Player Id Player Name Rank Simulated FGOE Top 5 Top 10
Rating Variance Low-end High-end
39470 JUSTIN TUCKER 1 2.04 0.59 1.04 2.98 86.0% 95.8%
33469 STEVEN HAUSCHKA 2 1.52 0.58 0.57 2.48 51.4% 78.0%
27091 MATT BRYANT 3 1.39 0.61 0.43 2.41 40.0% 68.8%
21213 ADAM VINATIERI 4 1.17 0.57 0.24 2.09 25.6% 53.3%
38134 DAN BAILEY 5 1.15 0.62 0.15 2.15 24.8% 51.7%
41953 CHRIS BOSWELL 6 1.09 0.78 −0.20 2.31 27.4% 48.8%
30932 STEPHEN GOSTKOWSKI 7 1.08 0.58 0.17 2.04 19.1% 46.1%
31446 MATT PRATER 8 0.91 0.58 −0.01 1.90 12.2% 33.7%
45046 HARRISON BUTKER 9 0.86 0.89 −0.58 2.33 21.6% 38.2%
43068 JOSH LAMBO 10 0.78 0.79 −0.50 2.05 16.1% 31.8%
25326 SEBASTIAN JANIKOWSKI 11 0.78 0.61 −0.27 1.73 9.3% 26.9%
38701 GREG ZUERLEIN 12 0.74 0.64 −0.40 1.70 8.4% 26.1%
30403 ROBBIE GOULD 13 0.60 0.61 −0.45 1.56 4.9% 17.3%
38815 JOHNNY HEKKER 39 −0.08 1.15 −1.95 1.75 6.2% 13.2%
44966 JAKE ELLIOTT 40 −0.13 0.93 −1.66 1.37 2.7% 7.4%
43846 KA'IMI FAIRBAIRN 41 −0.21 0.95 −1.70 1.38 2.2% 7.0%
46691 TYLER DAVIS 42 −0.28 1.20 −2.19 1.67 4.8% 10.5%
46409 DAVID MARVIN 43 −0.29 1.18 −2.27 1.49 4.0% 9.4%
43815 ROSS MARTIN 44 −0.32 1.12 −2.03 1.52 3.1% 8.6%
46663 MATTHEW MCCRANE 45 −0.33 1.10 −2.14 1.42 2.7% 7.6%
40114 CALEB STURGIS 46 −0.44 0.75 −1.68 0.78 0.2% 1.5%
45037 ZANE GONZALEZ 47 −0.61 1.07 −2.28 1.20 1.6% 3.9%
44240 SAM FICKEN 48 −0.63 1.17 −2.60 1.17 2.0% 4.9%
46236 DANIEL CARLSON 49 −0.80 1.22 −2.78 1.07 1.4% 3.9%
43704 NICK ROSE 50 −1.00 1.15 −2.81 0.86 0.6% 2.1%
43348 ROBERTO AGUAYO 51 −1.45 1.19 −3.34 0.49 0.1% 0.6%
  • Justin Tucker consistently ranks at the top of the leaderboard, demonstrating a high likelihood of being among the top 5 kickers by this metric. This positioning, even in seasons with a lot of variance (like 2018’s blocked kicks), highlights the model’s ability to provide a more stable and accurate assessment of true skill.
  • Robert Aguayo’s ranking reflects the influence of partial pooling from his prior seasons, contributing to his lower-end position despite a potentially stronger 2018 performance.
  • Tyler Davis with only a single kick in the dataset (which was missed), has a very wide credible interval. This indicates that while the model provides a point estimate, there is substantial uncertainty, suggesting his true skill could credibly range from one of the worst kickers to a top 5 performer. This demonstrates the model’s ability to quantify uncertainty, allowing decision-makers to understand the confidence level in each player’s evaluation, especially for those with sparse data.